What is Statistical Learning?

  • Involves utilizing a vast set of tools for understanding data

  • We can break these tools into two broad categories: Supervised and Unsupervised

  • We will utilize these tools to build statistical models for predicting or estimating an output based on one or more inputs


Statistical Learning Problems

  • Utilized to identify the risk factors for prostate cancer



  • Predict whether someone will have a heart attack on the basis of demographic, diet, and clinical measurements



  • Establish the relationship between salary and demographic variables in population survey data



Supervised Learning

  • Outcome measurement: Y

    • dependent variable, response, target


  • Vector of p predictor measurements X

    • inputs, regressors, covariates, features, independent variables


  • In the regression problem, Y is quantitative (e.g., price, blood pressure, temperature)

  • In the classification problem, Y takes values in a finite, unordered set (e.g., survived/died, digit 0-9, cancer class of tissue sample)

  • We have training data \((x_1, y_1),...,(x_N, y_N)\). These are observations of these measurements


Objectives

  • On the basis of the training data we would like to:

    • Accurately predict unseen test cases

    • Understand which inputs (independent variables) affect the outcome (dependent variable) and how

    • Access the quality of oour prediction and inferences


Philosophy

  • It is important to understand the ideas behind the various techniques, in order to know how and when to use them

  • One has to understand the simpler methods first, in order to grasp the more sophisticated ones

  • It is important to accurately assess the performance of a method, to know how well or how badly it is working [simpler methods often perform as well as fancier ones!]

  • This is an exciting research area, having important applications in science, industry and finance

  • Statistical learning is a fundamental ingredient in the training of a modern data scientist


Unsupervised Learning

  • No outcome (dependent) variable

    • Just a set of predictors (independent variables, features, etc.) measured ono a set of samples


  • Objective is more fuzzy

    • Find groups of samples hat behave similarly

    • Find features (predictors) that behave similarly

    • Find linear combinations of features (predictors, independent variables) with the most variation


  • Difficult to know how well you are doing

  • Different from supervised learning, but can be useful as a pre-processing step for supervised learning